47 research outputs found

    The Unsupervised Acquisition of a Lexicon from Continuous Speech

    Get PDF
    We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.Comment: 27 page technical repor

    Unsupervised Language Acquisition

    Full text link
    This thesis presents a computational theory of unsupervised language acquisition, precisely defining procedures for learning language from ordinary spoken or written utterances, with no explicit help from a teacher. The theory is based heavily on concepts borrowed from machine learning and statistical estimation. In particular, learning takes place by fitting a stochastic, generative model of language to the evidence. Much of the thesis is devoted to explaining conditions that must hold for this general learning strategy to arrive at linguistically desirable grammars. The thesis introduces a variety of technical innovations, among them a common representation for evidence and grammars, and a learning strategy that separates the ``content'' of linguistic parameters from their representation. Algorithms based on it suffer from few of the search problems that have plagued other computational approaches to language acquisition. The theory has been tested on problems of learning vocabularies and grammars from unsegmented text and continuous speech, and mappings between sound and representations of meaning. It performs extremely well on various objective criteria, acquiring knowledge that causes it to assign almost exactly the same structure to utterances as humans do. This work has application to data compression, language modeling, speech recognition, machine translation, information retrieval, and other tasks that rely on either structural or stochastic descriptions of language.Comment: PhD thesis, 133 page

    Methods for Parallelizing Search Paths in Phrasing

    Get PDF
    Many search problems are commonly solved with combinatoric algorithms that unnecessarily duplicate and serialize work at considerable computational expense. There are techniques available that can eliminate redundant computations and perform remaining operations concurrently, effectively reducing the branching factors of these algorithms. This thesis applies these techniques to the problem of parsing natural language. The result is an efficient programming language that can reduce some of the expense associated with principle-based parsing and other search problems. The language is used to implement various natural language parsers, and the improvements are compared to those that result from implementing more deterministic theories of language processing

    On The Unsupervised Induction Of Phrase-Structure Grammars

    No full text
    This paper examines why some previous approaches have failed to acquire desired grammars without supervision, and proposes that with a different conception of phrase-structure supervision might not be necessary. In particular, it describes in detail some reasons why SCFGs are poor mod- 2 CARL DE MARCKEN els to use for learning human language, especially when combined with the inside-outside algorithm. Following up on these arguments, it proposes that head-driven grammatical formalisms like link grammars (Sleator and Temperley, 1991) are better suited to the task, and introduces a framework for CFG induction that sidesteps many of the search problems that previous schemes have had. In the end, we hope the analysis presented here convinces others to look carefully at their representations and search strategies before blindly applying them to the language learning task. We start the discussion by examining the differences between the linguistic and statistical motivations for phrase structure; this frames our subsequent analysis. Then we introduce a simple extension to stochastic context-free grammars, and use this new class of language models in two experiments that pinpoint specific problems with both SCFGs and the search strategies commonly applied to them. Finally, we explore fixes to these problems

    Linguistic Structure as Composition and Perturbation

    No full text
    This paper discusses the problem of learning language from unprocessed text and speech signals, concentrating on the problem of learning a lexicon. In particular, it argues for a representation of language in which linguistic parameters like words are built by perturbing a composition of existing parameters. The power of the representation is demonstrated by several examples in text segmentation and compression, acquisition of a lexicon from raw speech, and the acquisition of mappings between text and artificial representations of meaning

    The Acquisition of a Lexicon from Paired Phoneme Sequences and Semantic Representations

    No full text
    We present an algorithm that acquires words (pairings of phonological forms and semantic representations) from larger utterances of unsegmented phoneme sequences and semantic representations. The algorithm maintains from utterance to utterance only a single coherent dictionary, and learns in the presence of homonymy, synonymy, and noise. Test results over a corpus of utterances generated from the Childes database of mother-child interactions are presented. 1 Introduction This paper is concerned with the machine-learning of a lexicon from utterances that consist of an unsegmented phoneme sequence paired with a semantic representation of what those phonemes collectively mean. The problem is modeled after the environment that a child learns in, presented with a continuous speech signal 1 and potentially hypothesizing a meaning for that signal based upon visual stimuli. We radically simplify the problem the child encounters for the computer by pre-digesting the speech stream into a sequ..

    Joseph Conrad's Letters to R. B. Cunninghame Graham [book].

    No full text
    These letters display Conrad's friendship with an aristocratic Scottish adventurer who was a pioneering socialist. The elaborate annotations set the friendship in cultural and political contexts, and show how Graham influenced Conrad's writings

    Parsing The Lob Corpus

    No full text
    This paper s presents a rapid and robust parsing system currently used to learn from large bodies of unedited text. The system contains a multivalued part-of-speech disambiguator and a novel parser employing bottom-up recognition to fred the constituent phrases of larger structures that might be too difficult to analyze. The results of applying the disambiguator and parser to large sections of the Lancaster/ Oslo-Bergen corpus are presented
    corecore